-
Notifications
You must be signed in to change notification settings - Fork 80
Torch integration #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Torch integration #692
Conversation
…d style consistency (#725) - [x] Fix license text: "MIT license" → "MIT License" in multiple files - [x] Rename files with "_2" suffix and update references - [x] Add missing license headers - [x] Fix header guards to follow MSCCLPP_EXT_<FILE_NAME>_HPP_ format - [x] Fix enum naming consistency - [x] CommResult enum to CamelCase - [x] CollectiveBufferMode enum to CamelCase (Any, InPlace, OutOfPlace) - [x] AlgorithmType enum to CamelCase (Native, DSL) - [x] Fix comment in src/core/gpu_utils.cc - [x] Fix LogSubsys::COUNT case in src/core/include/logger.hpp - [x] Add explanatory comment - [x] Add [[fallthrough]] attribute - [x] Apply clang-format - [x] Remove _codeql_detected_source_root file and add to .gitignore - [x] Update documentation paths - [x] Fix NCCL library paths: build/apps/nccl/ → build/lib/ - [x] Fix test binary paths: ./test/ → ./bin/ <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: chhwang <[email protected]>
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run mscclpp-ut |
|
Azure Pipelines successfully started running 1 pipeline(s). |
chhwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's tackle comments in another PR
| __all__ += algorithm.__all__ | ||
| __all__ += comm.__all__ | ||
| __all__ += compiler.__all__ | ||
| __all__ += buffer.__all__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unneeded
| from mscclpp._mscclpp import ( | ||
| Algorithm as _Algorithm, | ||
| DslAlgorithm as _DslAlgorithm, | ||
| AlgorithmType as _AlgorithmType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python-binded objects would be named differently in the future (e.g., adding Cpp* prefix) to avoid confusion
| ) | ||
| ).hexdigest() | ||
|
|
||
| plan_dir = os.environ.get("MSCCLPP_EXECUTION_PLAN_DIR", Path.home() / ".cache/mscclpp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- All env variables should be collected and documented in
env.cpp, even if it's used only at Python level. - Is execution plan a part of "cache", meaning, this belongs to the system not to the user?
- I'd prefer
MSCCLPP_CACHE_DIRenv and we internally decide what to put where.
| recompilation. | ||
| The cache location can be configured via the `MSCCLPP_NATIVE_CACHE_DIR` | ||
| environment variable (defaults to `~/.cache/mscclpp/native`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer MSCCLPP_CACHE_DIR env and we internally decide what to put where.
| rocm_home = os.environ.get("ROCM_HOME") | ||
| return os.path.join(rocm_home, "bin/hipcc") if rocm_home else "hipcc" | ||
| else: | ||
| cuda_home = os.environ.get("CUDA_HOME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not MSCCLPP_CUDA_HOME/MSCCLPP_ROCM_HOME?
|
|
||
| from .algorithm_collection_builder import * | ||
|
|
||
| __all__ = algorithm_collection_builder.__all__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unneeded
| break; | ||
| case DataType::FP8_E4M3: | ||
| case DataType::FP8_E5M2: | ||
| // FP8 is not supported in CUDA execution kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we warn or throw here?
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
Will resolve all the comments in another PR |
Reorganize current native algorithm implementation and DSL algorithm implementation.
Provide unified API for DSL algo and native algo and provide interface to tune the algo
Provide interface for pytorch integration with native API and DSL